PSCI 8357 - STAT II
Department of Political Science, Vanderbilt University
February 3, 2026
This week we will see, that we might use regression agnostically to estimate causal estimands as well.
BUT this only solves the estimation problem.
Problem: If we want to learn about relationship between \(X\) and \(Y\)
CEF
The CEF, \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\), is the expected value of \(Y_i\) across values of \(X_i\):
For continuous \(Y_i\) \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \int_{\mathcal{Y}} y f(y {\:\vert\:}X_i) \, dy \]
For discrete \(Y_i\): \[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = \sum_{\mathcal{Y}} y p(y {\:\vert\:}X_i) \]
CEF Decomposition Property
\[ Y_i = \underbrace{{\mathbb{E}}[Y_i {\:\vert\:}X_i]}_{\text{explained by $X_i$}} + \underbrace{\varepsilon_i}_{\text{unexplained}}, \]
where \({\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] = 0\) and \(\varepsilon_i\) is uncorrelated with any function of \(X_i\)
To see this property recall
\[ \begin{align*} \varepsilon_i &= Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] \quad \implies\\ {\mathbb{E}}[\varepsilon_i {\:\vert\:}X_i] &= {\mathbb{E}}[Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] {\:\vert\:}X_i] = 0 \end{align*} \]
also \({\mathbb{E}}[h(X_i) \varepsilon_i] = 0\). (How can we use Law of Iterated Expectations to prove this?)
CEF Prediction Property
\[ {\mathbb{E}}[Y_i {\:\vert\:}X_i] = {\arg\!\min}_{g(X_i)} {\mathbb{E}}\left[ (Y_i - g(X_i))^2 \right], \] where \(g(X_i)\) is any function of \(X_i\).
\[ \begin{align*} (Y_i - g(X_i))^2 &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i] + {\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2 \\ &= \left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)^2 + 2\left(Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]\right)\left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right) \\ &\quad + \left({\mathbb{E}}[Y_i {\:\vert\:}X_i] - g(X_i)\right)^2. \end{align*} \]
Density distributions show the spread of \(Y\) values at each discrete \(X\); black line connects the conditional means.
The CEF properties we just established are important because:
Decomposition: Any outcome can be split into a systematic part (explained by covariates) and noise.
Optimality: The CEF is the best predictor of \(Y_i\) given \(X_i\) in the MSE sense.
The \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) quantity looks very familiar, we already used in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\).
We want to see if regression helps us with estimating these quantities. Especially when we want to estimate differences in means.
Note: There is nothing causal in \({\mathbb{E}}[Y_i {\:\vert\:}T_i]\) or \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\), so we still need identification
Before we move on to this, we need to recall important facts about regression coefficients
Population regression coefficients vector is given by (directly follows from \({\mathbb{E}}[X_i \varepsilon_i] = 0\)) \[ \beta = {\mathbb{E}}[X_i X_i^{\prime}]^{-1} {\mathbb{E}}[X_i Y_i] \]
Regression coefficient in single covariate case is given by (population and sample analog) \[ \beta = \frac{{\mathrm{cov}}(Y_i,X_i)}{{\mathbb{V}}(X_i)}, \quad \widehat{\beta} = \frac{\sum_{i = 1}^{n} (Y_i - \bar{Y})(X_i - \bar{X})}{\sum_{i = 1}^{n} (X_i - \bar{X})^2} \]
Regression coefficient in multiple covariate case is given by \[ \beta_{k} = \frac{{\mathrm{cov}}(\tilde{Y}_i,\tilde{X}_{ki})}{{\mathbb{V}}(\tilde{X}_{ki})}, \] where \(\tilde{X}_{ki}\) is the residual from regressing \(X_k\) on \(X_{-k}\)
Theorem: Linear CEF
If CEF \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\) is linear in \(X_i\), then the population regression function \(X_i^{\prime} \beta\) returns exactly \({\mathbb{E}}[Y_i {\:\vert\:}X_i]\).
To see this property we can
Use decomposition property of CEF to see \({\mathbb{E}}[ X_i (Y_i - {\mathbb{E}}[Y_i {\:\vert\:}X_i]) ] = 0\)
Substitute for \({\mathbb{E}}[Y_i {\:\vert\:}X_i] = X_i^{\prime} b\) and solve
Regression Prediction Property
The function \(X_i' \beta\) provides the Minimal MSE linear approximation to \({\mathbb{E}}[Y_i | X_i]\), that is:
\[ \beta = {\arg\!\min}_b {\mathbb{E}}\left[ ({\mathbb{E}}[Y_i | X_i] - X_i' b)^2 \right]. \]
\[ \begin{align*} (Y_i - X_i' b)^2 &= \left( (Y_i - {\mathbb{E}}[Y_i | X_i]) + ({\mathbb{E}}[Y_i | X_i] - X_i' b) \right)^2 \\ &= (Y_i - {\mathbb{E}}[Y_i | X_i])^2 + ({\mathbb{E}}[Y_i | X_i] - X_i' b)^2 \\ &\quad + 2 (Y_i - {\mathbb{E}}[Y_i | X_i]) ({\mathbb{E}}[Y_i | X_i] - X_i' b). \end{align*} \]
Suppose \(\mathcal{T} = \{0, 1\}\)
Under SUTVA (no interference and consistency) POs are \(Y_{i} (1)\) and \(Y_{i} (0)\).
A unit-level treatment effect is, \(\tau_i = Y_{i} (1) - Y_{i} (0)\)
We observe \(X_i\), \(T_i\) and, \(Y_i = T_i Y_{i} (1) + (1 - T_i )Y_{i} (0)\).
In this simple case OLS estimator solves the least squares problem:
\[ (\widehat{\tau}, \widehat{\alpha}) = {\arg\!\min}_{\tau, \alpha} \sum_{i=1}^n \left(Y_i - \alpha - \tau T_i\right)^2 \]
Coefficient \(\tau\) is algebraically equivalent to the difference in means (\(\tau_{DiM}\)):
\[ \widehat{\tau} = \bar{Y}_1 - \bar{Y}_0 = \widehat{\tau}_{DiM} \]
\[ \begin{align*} Y_i &= T_i Y_i(1) + (1 - T_i) Y_i(0) \\ &= Y_i(0) + T_i ( Y_i(1) - Y_i(0) ) \quad\text{($\because$ distribute)}\\ &= Y_i(0) + \tau_i T_i \quad \text{($\because$ unit treatment definition)}\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + ( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (\tau_i - \tau) \quad (\because \pm {\mathbb{E}}[Y_i(0)] + \tau T_i)\\ &= {\mathbb{E}}[Y_i(0)] + \tau T_i + (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \quad\text{($\because$ distribute)}\\ &= \alpha + \tau T_i + \eta_i \end{align*} \]
Linear functional form fully justified by SUTVA assumption alone:
\[ \eta_i = (1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) \]
\[ \begin{align*} {\mathbb{E}}[\eta_i {\:\vert\:}T_i] &= {\mathbb{E}}[(1 - T_i)( Y_i(0) - {\mathbb{E}}[Y_i(0)] ) + T_i (Y_i(1) - {\mathbb{E}}[Y_i(1)]) {\:\vert\:}T_i] \\ &= (1 - T_i) ({\mathbb{E}}[Y_i(0) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(0)]) + T_i ({\mathbb{E}}[Y_i(1) {\:\vert\:}T_i] - {\mathbb{E}}[Y_i(1)]) \end{align*} \]
Randomization + consistency allow linear model.
Does not imply homoskedasticity or normal errors, though!
Practical implication: Use heteroskedasticity-robust (HC2) standard errors for inference, e.g. via lm_robust().
We just showed: under strong ignorability (random assignment), regression estimates causal effects.
But what if treatment is not randomly assigned?
This is the selection on observables framework:
Assumption: Conditional Ignorability (CIA)
\[ \{ Y_i(0), Y_i(1) \} {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \]
Given covariates \(X_i\), treatment assignment is independent of potential outcomes.
Interpretation: Within groups defined by \(X_i\), treatment is effectively randomized.
Assume constant treatment effects: \(\tau_i = \tau\) for all \(i\).
Potential outcomes follow: \(f_i(t) = \alpha + \tau t + \eta_i\)
\[ \eta_i = X_i^{\prime} \gamma + \nu_i, \]
where \(\gamma\) captures the linear relationship between \(X_i\) and outcomes, and \(\nu_i\) is the residual variation.
\[ Y_i = f_i(T_i) = \alpha + \tau T_i + X_i^{\prime} \gamma + \nu_i \]
\[ \nu_i {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i \]
Result: The regression error \(\nu_i\) is:
Therefore, OLS on \(Y_i = \alpha + \tau T_i + X_i^{\prime} \gamma + \nu_i\) yields consistent estimates.
\[ {\mathbb{E}}[f_i(t) {\:\vert\:}T_i = t, X_i] = {\mathbb{E}}[f_i(t) {\:\vert\:}X_i] = \alpha + \tau t + X_i^{\prime} \gamma \]
\[ \begin{align*} {\mathbb{E}}[f_i(t) - f_i(t - v) {\:\vert\:}X_i] &= (\alpha + \tau t + X_i^{\prime} \gamma) - (\alpha + \tau (t - v) + X_i^{\prime} \gamma) \\ &= \tau v \end{align*} \]
Key observation: The \(X_i^{\prime} \gamma\) terms cancel out!
What we assumed:
What we achieved:
Critical question: What happens if we omit important confounders from \(X_i\)?
\[ \begin{align*} {\mathrm{cov}}(Y_i, T_i) &= {\mathrm{cov}}(\alpha + \tau T_i + X_i' \gamma + \nu_i,\, T_i) \\ &= \tau {\mathrm{cov}}(T_i, T_i) + {\mathrm{cov}}(X_{1i} \gamma_1 + \ldots + X_{Ki} \gamma_K, T_i) \\ &= \tau {\mathbb{V}}(T_i) + \gamma_1 {\mathrm{cov}}(X_{1i}, T_i) + \ldots + \gamma_K {\mathrm{cov}}(X_{Ki}, T_i) \end{align*} \]
\[ \implies \frac{{\mathrm{cov}}(Y_i, T_i)}{{\mathbb{V}}(T_i)} = \tau + \underbrace{\gamma^{\prime} \delta}_{\text{OVB}} \]
where \(\delta\) are coefficients from regressions of \(X_1, \ldots, X_K\) on \(T_i\).
OVB = \(\gamma^\prime \delta\), where
Same holds when we consider the case where we include some controls:
\[ \text{OVB} = \tilde{\gamma}' \tilde{\delta}. \]
Everything is just defined in terms of variables that have been residualized with respect to the included controls.
OVB = confounder impact \(\times\) imbalance (Cinelli and Hazlett 2020).
Let’s practice applying the OVB formula:
OVB = \((X_{ki}, Y_i)\) relationships \(\times\) \((X_{ki}, T_i)\) relationships
Effect of democratic institutions on growth, estimated via regression of growth on democratic institutions.
Effect of exposure to negative advertisements on turnout, estimated via regression of turnout on the number of ads seen.
set.seed(20250127) # set seed
n <- 1000 # sample size
tau <- 0.5 # ATE
gamma <- 0.3 # effect of confounder on outcome
delta <- 0.3 # proportional to effect of treatment on confounder (only if less than sd ratio)
# confounder
confounder <- rnorm(n, mean = 50, sd = 10)
# democratic institutions (correlated with confounder)
democracy_score <- delta * confounder + rnorm(n, mean = 0, sd = 5)
# economic growth (influenced by both investment and democratic institutions)
growth <- tau *
democracy_score +
gamma * confounder +
rnorm(n, mean = 0, sd = 5)
# true regression including the confounder
model_unbiased <- lm(growth ~ democracy_score + confounder)
cat("Unbiased model error:", unname(model_unbiased$coefficients[2]) - tau, "\n")Unbiased model error: -0.01922573
Biased model error: 0.3081032
Omitted variables is a misleading term because it could suggest that you want to include any variable that is correlated with treatment and outcomes.
But remember bad controls exist, e.g.
The discussion of OVB suggests that we can use regression to adjust for variables (\(X_i\)) to estimate the treatment effect (\(\tau\)) in two ways.
Long regression: Include covariates \(X_i\) directly in the regression model.
Residualized regression:
Result: Coefficient on \(T_{i}\) in long regression and on \(\tilde{T}_i\) in residualized regression are identical.
Nodes: \(T\), \(Y\), \(Z_1\), \(Z_2\), and \(Z_3\).
Paths: \(T \to Y\), \(T \leftarrow Z_3 \to Y\), \(T \leftarrow Z_1 \to Z_3 \leftarrow Z_2 \to Y\), etc.
\(Z_1\) is a parent of \(T\) and \(Z_3\).
\(T\) and \(Z_3\) are children of \(Z_1\).
\(Z_1\) is an ancestor of \(Y\).
\(Y\) is a descendant of \(Z_1\).
Definition: Types of Paths
A causal (front-door) path from \(T\) to \(Y\) is a path where every arrow points away from \(T\) toward \(Y\): \(T \to \cdots \to Y\)
A back-door path from \(T\) to \(Y\) is any path that starts with an arrow into \(T\): \(T \leftarrow \cdots\)
Causal paths transmit the effect of \(T\) on \(Y\) — we want to keep these open!
Back-door paths create spurious associations (confounding) — we want to block these.
Example: \(T \to W \to Y\) is a causal path. \(T \leftarrow Z \to Y\) is a back-door path.
Intuition: Back-door paths are “alternative explanations” for why \(T\) and \(Y\) might be correlated, even if \(T\) has no causal effect on \(Y\).
Key insight: Colliders have special properties:
Example: \(Z_3\) is a collider on the path \(Z_1 \to Z_3 \leftarrow Z_2\).
Definition: Blocked Paths
A path \(p\) is blocked by a set of nodes \(X\) if:
Definition: \(d\)-separation
A set \(X\) \(d\)-separates \(T\) and \(Y\) if \(X\) blocks all paths between \(T\) and \(Y\).
If \(X\) \(d\)-separates \(T\) and \(Y\), then \(Y {\mbox{$\perp\!\!\!\perp$}}T {\:\vert\:}X\).
Theorem: The Back-Door Criterion
A set \(X\) satisfies the back-door criterion relative to \((T, Y)\) if:
Why condition 1? Blocking back-door paths eliminates confounding — the spurious association between \(T\) and \(Y\).
Why condition 2? Descendants of \(T\) are post-treatment variables. Conditioning on them could block part of the causal effect (if they’re mediators), and/or induce collider/selection bias (if they’re affected by \(T\) and share causes with \(Y\), etc.).
\[ \begin{align*} {\mathbb{E}}[Y_i(t)] &= {\mathbb{E}}_X[{\mathbb{E}}[Y_i {\:\vert\:}T_i = t, X_i]] \implies \\ \implies\ \tau_{ATE} &= {\mathbb{E}}[Y_i(1)] - {\mathbb{E}}[Y_i(0)] = {\mathbb{E}}_X[{\mathbb{E}}[Y_i {\:\vert\:}T_i = 1, X_i] - {\mathbb{E}}[Y_i {\:\vert\:}T_i = 0, X_i]] \end{align*} \]
Follow Cinelli, Forney, and Pearl (2024) which provides a systematic framework for thinking about control variables.
Key insight: Not all variables that are correlated with treatment and outcome should be controlled for!
We will classify controls as:
In model (a) reduction in variation is good! \(\rightarrow\) higher precision
In model (b) reduction in variation is bad! \(\rightarrow\) lower precision
In model (c) reduction in variation is good again! \(\rightarrow\) higher precision
In models (a) and (b) controlling for \(Z\) unblocks back-door paths and induces relationship between \(X\) and \(Y\).
In models (c) and (d) controlling for \(Z\) will unblock the back-door path \(X \rightarrow Z \leftarrow U \rightarrow Y\).
In models (a) and (b) controlling for \(Z\) blocks the causal path.
In model (c) controlling for \(Z\) blocks part of the causal path.
In model (d) controlling for \(Z\) will not block the causal path or induce any bias.
To see the intuition behind post-treatment bias consider the following example
Suppose \(X = 0, 1\) randomly assigned, and then
\[ \begin{align*} Z &= X + \varepsilon_Z, \\ Y &= \beta X + \gamma Z + \varepsilon_Y, \end{align*} \]
where \(\varepsilon_Z\) and \(\varepsilon_Y\) are independent standard normal draws.
Substituting in \(Y\):
\[ Y = (\beta + \gamma)X + \gamma \varepsilon_Z + \varepsilon_Y \]
Effect of \(X\) on \(Y\) is \(\beta + \gamma\).
Controlling for \(Z\), we would estimate an effect of \(\beta\).
The bias, \(-\gamma\), is the portion of the effect that has been “stolen away” by conditioning on \(Z\).
Be mindful of what controls you include in your analysis (even if it is an experiment).
Draw a DAG with controls you plan to include and see whether
Be also mindful of the sizes of the effects of potential confounders. If the effect on main independent and dependent variable can be proven to be limited, the OVB is small!
Thus far we assumed constant effects (\(\tau_i = \tau\)) and linearity (\({\mathbb{E}}[\eta_i {\:\vert\:}X_i] = X'_i \gamma\)).
These are strong assumptions! What happens if treatment effects vary across units?
Setup with heterogeneous effects:
Maintain conditional ignorability (CIA): \(\{ Y_{i}(0), Y_{i}(1) \} {\mbox{$\perp\!\!\!\perp$}}T_i {\:\vert\:}X_i\)
Goal: Estimate \(\tau_{ATE} = {\mathbb{E}}[\tau_i]\) using regression.
Under CIA, the ATE can be written as a weighted average of conditional ATEs:
\[ \begin{align*} \tau_{ATE} &= {\mathbb{E}}[\tau_i] = {\mathbb{E}}_{X} [{\mathbb{E}}[Y_i(1) - Y_i(0) {\:\vert\:}X_i]] \\ &= {\mathbb{E}}_{X} [\underbrace{{\mathbb{E}}[Y_i(1) {\:\vert\:}X_i] - {\mathbb{E}}[Y_i(0) {\:\vert\:}X_i]}_{\tau_x}] \\ &= \sum_{x} \tau_x {\textrm{Pr}}(X_i = x), \end{align*} \]
where \(\tau_x \equiv {\mathbb{E}}[Y_i(1) - Y_i(0) {\:\vert\:}X_i = x]\) is the CATE for stratum \(x\).
To flexibly control for \(X_i\), use a saturated regression (fixed effects for each \(X_i\) value):
\[ Y_i = \alpha_1 \mathbb{1}[X_i = x_1] + \cdots + \alpha_L \mathbb{1}[X_i = x_L] + \tau T_i + \varepsilon_i, \]
where \(\mathbb{1}[\cdot]\) is the indicator function. (One \(\alpha\) omitted if including intercept.)
Why saturated?
Regression anatomy is: \(\widehat{\tau} = \frac{{\mathrm{cov}}(\tilde{Y}_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)} = \frac{{\mathrm{cov}}(Y_i, \tilde{T_i})}{{\mathbb{V}}(\tilde{T_i})}\), where \(\tilde{T}_i\) is residuals from regression of \(T_i\) on other regressors.
Let’s see if it actually works
# simulate data
n <- 1000
X <- rnorm(n)
D <- 0.5 * X + rnorm(n) # do not use T!!!
Y <- 2 * D + 1 * X + rnorm(n)
# standard regression
standard <- coef(lm(Y ~ D + X))["D"]
# make Y tilde and D tilde
tilde_Y <- lm(Y ~ X)$residuals
tilde_D <- lm(D ~ X)$residuals
# regression anatomy
anatomy <- coef(lm(tilde_Y ~ tilde_D))["tilde_D"]
# simplified regression anatomy
anatomy_simp <- coef(lm(Y ~ tilde_D))["tilde_D"]
data.frame(
Method = c("Standard", "Regression Anatomy",
"Regression Anatomy (Simplified)"),
Coefficient = c(standard, anatomy, anatomy_simp)
) |>
knitr::kable(digits = 3)| Method | Coefficient |
|---|---|
| Standard | 1.978 |
| Regression Anatomy | 1.978 |
| Regression Anatomy (Simplified) | 1.978 |
Define the residualized treatment: \(\tilde{T}_i \equiv T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]\)
Key property: \({\mathbb{E}}[\tilde{T}_i] = 0\) (residuals have mean zero)
Start from regression anatomy and simplify the covariance:
\[ \begin{align*} \widehat{\tau} &= \frac{{\mathrm{cov}}(Y_i, \tilde{T}_i)}{{\mathbb{V}}(\tilde{T}_i)} = \frac{{\mathbb{E}}[Y_i \tilde{T}_i] - {\mathbb{E}}[Y_i]\textcolor{#d65d0e}{{\mathbb{E}}[\tilde{T}_i]}}{{\mathbb{E}}[\tilde{T}_i^2]} \\ &= \frac{{\mathbb{E}}[Y_i \tilde{T}_i]}{{\mathbb{E}}[\tilde{T}_i^2]} \quad \text{($\because$ ${\mathbb{E}}[\tilde{T}_i] = 0$)} \end{align*} \]
Apply law of iterated expectations to the numerator:
\[ {\mathbb{E}}[Y_i \tilde{T}_i] = {\mathbb{E}}\big[{\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] \tilde{T}_i\big] \]
This works because \(\tilde{T}_i\) is not a function of \(T_i\) or \(X_i\).
Expand \({\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i]\) using the switching equation: \(Y_i = T_i Y_i(1) + (1-T_i) Y_i(0)\)
\[ \begin{align*} {\mathbb{E}}[Y_i {\:\vert\:}T_i, X_i] &= T_i {\mathbb{E}}[Y_{i}(1) {\:\vert\:}T_i, X_i] + (1-T_i) {\mathbb{E}}[Y_{i}(0) {\:\vert\:}T_i, X_i] \\ &= T_i {\mathbb{E}}[Y_{i}(1) {\:\vert\:}X_i] + (1-T_i) {\mathbb{E}}[Y_{i}(0) {\:\vert\:}X_i] \quad \text{($\because$ CIA)}\\ &= T_i \big({\mathbb{E}}[Y_{i}(1) {\:\vert\:}X_i] - {\mathbb{E}}[Y_{i}(0) {\:\vert\:}X_i]\big) + {\mathbb{E}}[Y_{i}(0) {\:\vert\:}X_i] \quad \text{($\because$ rearrange)}\\ &= T_i \tau_x + {\mathbb{E}}[Y_i(0) {\:\vert\:}X_i] \end{align*} \]
\[ \begin{align*} \widehat{\tau} &= \frac{{\mathbb{E}}\big[(T_i \tau_x + {\mathbb{E}}[Y_i(0) {\:\vert\:}X_i]) \tilde{T}_i\big]}{{\mathbb{E}}[\tilde{T}_i^2]} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}[T_i \tau_x \tilde{T}_i] + {\mathbb{E}}[{\mathbb{E}}[Y_i(0) {\:\vert\:}X_i] \tilde{T}_i]}{{\mathbb{E}}[\tilde{T}_i^2]} \quad \text{($\because$ distribute)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}[T_i \tau_x \tilde{T}_i]}{{\mathbb{E}}[\tilde{T}_i^2]} \quad \text{($\because$ ${\mathbb{E}}[{\mathbb{E}}[Y_i(0) {\:\vert\:}X_i] \tilde{T}_i] = 0$ since ${\mathbb{E}}[\tilde{T}_i {\:\vert\:}X_i] = 0$)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X[\tau_x {\mathbb{E}}[T_i \tilde{T}_i {\:\vert\:}X_i]]}{{\mathbb{E}}_X[{\mathbb{E}}[\tilde{T}_i^2 {\:\vert\:}X_i]]} \quad \text{($\because$ law of iterated expectations)}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X[\tau_x {\mathbb{E}}[T_i (T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i]) {\:\vert\:}X_i] ]}{{\mathbb{E}}_X[{\mathbb{E}}[(T_i - {\mathbb{E}}[T_i {\:\vert\:}X_i])^2 {\:\vert\:}X_i]]} \quad \text{($\because$ definition of $\tilde{T}_i$ )}} \\ &\class{fragment}{{}= \frac{{\mathbb{E}}_X[\tau_x {\mathbb{V}}(T_i {\:\vert\:}X_i)]}{{\mathbb{E}}_X[{\mathbb{V}}(T_i {\:\vert\:}X_i)]} \quad \text{($\because$ ${\mathbb{E}}[T_i \tilde{T}_i {\:\vert\:}X_i] = {\mathbb{V}}(T_i {\:\vert\:}X_i)$)}} \end{align*} \]
Compare the target vs. what OLS estimates:
\[ \tau_{ATE} = \sum_{x} \tau_x {\textrm{Pr}}(X_i = x), \]
versus (in binary \(T_i\) case)
\[ \widehat{\tau} \xrightarrow{p} \frac{{\mathbb{E}}_X[\tau_x {\mathbb{V}}(T_i {\:\vert\:}X_i)]}{{\mathbb{E}}_X[{\mathbb{V}}(T_i {\:\vert\:}X_i)]} = \frac{\sum_x \tau_x \textcolor{#d65d0e}{p_x(1-p_x)} {\textrm{Pr}}(X_i = x)}{\sum_x \textcolor{#d65d0e}{p_x(1-p_x)} {\textrm{Pr}}(X_i = x)} \]
where \(p_x = {\textrm{Pr}}(T_i = 1 {\:\vert\:}X_i = x)\).
\(\widehat{\tau}\) aggregates via conditional variance weighting with respect to \(T_i\) instead of just population shares.
If \(\tau_x\) was constant across \(X_i\), regression recovers ATE, but variance weighting could reduce efficiency (more uncertainty).
If \(T_i {\mbox{$\perp\!\!\!\perp$}}X_i\), then \(p_x(1-p_x)\) is constant across strata and cancels out, so \(\widehat{\tau}\) reduces to weighting by \({\textrm{Pr}}(X_i = x)\).
Logic carries through to continuous treatments (Angrist and Pischke 2009, 77–80; Aronow and Samii 2016).
Aronow and Samii (2016) show that for arbitrary \(T_i\) and \(X_i\),
\[ \widehat{\tau} \xrightarrow{p} \frac{{\mathbb{E}}[w_i \tau_i]}{{\mathbb{E}}[w_i]}, \quad \text{where } w_i = (T_i - {\mathbb{E}}[T_i | X_i])^2, \]
in which case
\[ {\mathbb{E}}[w_i | X_i] = {\mathbb{V}}[T_i {\:\vert\:}X_i]. \]
The effective sample is weighted by \(\widehat{w}_i = (T_i - \widehat{{\mathbb{E}}}[T_i | X_i])^2\) (squared residual from regression of \(T_i\) on covariates).
Even with a representative sample, regression estimates may not aggregate effects in a representative manner. Regression estimates are local to an effective sample.
set.seed(20250202) # set seed
n <- 1000 # sample size
tau_base <- 0.5
gamma <- 0.1 # effect of X on outcome
# some discrete covariate
X <- sample(x = 1:100, size = n, replace = T)
# total treatment effect (assuming possible heterogeneity)
tau_total <- sum((tau_base + 0.01 * 1:100) / 100)
# democratic institutions (correlated with confounder)
democracy_high <- rbinom(n, size = 1, prob = .5)
democracy_high_2 <-
rbinom(n, size = 1, prob = sapply(X, function(x) .5 + 0.01 * x))
# economic growth (influenced by both investment and democratic institutions)
growth <-
(tau_base + 0.01 * X) *
democracy_high +
gamma * X +
rnorm(n, mean = 0, sd = 5)
growth_2 <-
(tau_base + 0.01 * X) *
democracy_high_2 +
gamma * X +
rnorm(n, mean = 0, sd = 5)
# regression ignoring the confounder
bias1 <- lm(growth ~ democracy_high + factor(X))$coefficients[2] - tau_total
# regression ignoring the confounder
bias2 <- lm(growth_2 ~ democracy_high_2 + factor(X))$coefficients[2] - tau_totalConsider a multiple regression model: \(Y_i = \alpha + \tau T_i + X_i^\prime \gamma + \nu_i\).
To find \(\widehat{\tau}\), the coefficient on \(T_i\), the Frisch-Waugh-Lovell Theorem states that:
Regress \(Y_i\) on \(X_i\) and obtain the residuals \(\tilde{Y}_i = Y_i - X_i^\prime \widehat{\pi}_Y\).
Regress \(T_i\) on \(X_i\) and obtain the residuals \(\tilde{T}_i = T_i - X_i^\prime \widehat{\pi}_T\).
Regress \(\tilde{Y}_i\) on \(\tilde{T}_i\) to obtain \(\widehat{\tau}\).
In addition, the \(R^2\) and F-statistics of these regressions will be the same as those from the full model regression.
Intuition: